63 research outputs found

    Using Intelligent Prefetching to Reduce the Energy Consumption of a Large-scale Storage System

    Get PDF
    Many high performance large-scale storage systems will experience significant workload increases as their user base and content availability grow over time. The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) center hosts one such system that has recently undergone a period of rapid growth as its user population grew nearly 400% in just about three years. When administrators of these massive storage systems face the challenge of meeting the demands of an ever increasing number of requests, the easiest solution is to integrate more advanced hardware to existing systems. However, additional investment in hardware may significantly increase the system cost as well as daily power consumption. In this paper, we present evidence that well-selected software level optimization is capable of achieving comparable levels of performance without the cost and power consumption overhead caused by physically expanding the system. Specifically, we develop intelligent prefetching algorithms that are suitable for the unique workloads and user behaviors of the world\u27s largest satellite images distribution system managed by USGS EROS. Our experimental results, derived from real-world traces with over five million requests sent by users around the globe, show that the EROS hybrid storage system could maintain the same performance with over 30% of energy savings by utilizing our proposed prefetching algorithms, compared to the alternative solution of doubling the size of the current FTP server farm

    \u3cem\u3eHP-DAEMON\u3c/em\u3e: \u3cem\u3eH\u3c/em\u3eigh \u3cem\u3eP\u3c/em\u3eerformance \u3cem\u3eD\u3c/em\u3eistributed \u3cem\u3eA\u3c/em\u3edaptive \u3cem\u3eE\u3c/em\u3energy-efficient \u3cem\u3eM\u3c/em\u3eatrix-multiplicati\u3cem\u3eON\u3c/em\u3e

    Get PDF
    The demands of improving energy efficiency for high performance scientific applications arise crucially nowadays. Software-controlled hardware solutions directed by Dynamic Voltage and Frequency Scaling (DVFS) have shown their effectiveness extensively. Although DVFS is beneficial to green computing, introducing DVFS itself can incur non-negligible overhead, if there exist a large number of frequency switches issued by DVFS. In this paper, we propose a strategy to achieve the optimal energy savings for distributed matrix multiplication via algorithmically trading more computation and communication at a time adaptively with user-specified memory costs for less DVFS switches, which saves 7.5% more energy on average than a classic strategy. Moreover, we leverage a high performance communication scheme for fully exploiting network bandwidth via pipeline broadcast. Overall, the integrated approach achieves substantial energy savings (up to 51.4%) and performance gain (28.6% on average) compared to ScaLAPACK pdgemm() on a cluster with an Ethernet switch, and outperforms ScaLAPACK and DPLASMA pdgemm() respectively by 33.3% and 32.7% on average on a cluster with an Infiniband switch

    Network Binarization via Contrastive Learning

    Full text link
    Neural network binarization accelerates deep models by quantizing their weights and activations into 1-bit. However, there is still a huge performance gap between Binary Neural Networks (BNNs) and their full-precision (FP) counterparts. As the quantization error caused by weights binarization has been reduced in earlier works, the activations binarization becomes the major obstacle for further improvement of the accuracy. BNN characterises a unique and interesting structure, where the binary and latent FP activations exist in the same forward pass (i.e., Binarize(aF)=aB\text{Binarize}(\mathbf{a}_F) = \mathbf{a}_B). To mitigate the information degradation caused by the binarization operation from FP to binary activations, we establish a novel contrastive learning framework while training BNNs through the lens of Mutual Information (MI) maximization. MI is introduced as the metric to measure the information shared between binary and FP activations, which assists binarization with contrastive learning. Specifically, the representation ability of the BNNs is greatly strengthened via pulling the positive pairs with binary and FP activations from the same input samples, as well as pushing negative pairs from different samples (the number of negative pairs can be exponentially large). This benefits the downstream tasks, not only classification but also segmentation and depth estimation, etc. The experimental results show that our method can be implemented as a pile-up module on existing state-of-the-art binarization methods and can remarkably improve the performance over them on CIFAR-10/100 and ImageNet, in addition to the great generalization ability on NYUD-v2.Comment: Accepted to ECCV 202

    Lipschitz Continuity Retained Binary Neural Network

    Full text link
    Relying on the premise that the performance of a binary neural network can be largely restored with eliminated quantization error between full-precision weight vectors and their corresponding binary vectors, existing works of network binarization frequently adopt the idea of model robustness to reach the aforementioned objective. However, robustness remains to be an ill-defined concept without solid theoretical support. In this work, we introduce the Lipschitz continuity, a well-defined functional property, as the rigorous criteria to define the model robustness for BNN. We then propose to retain the Lipschitz continuity as a regularization term to improve the model robustness. Particularly, while the popular Lipschitz-involved regularization methods often collapse in BNN due to its extreme sparsity, we design the Retention Matrices to approximate spectral norms of the targeted weight matrices, which can be deployed as the approximation for the Lipschitz constant of BNNs without the exact Lipschitz constant computation (NP-hard). Our experiments prove that our BNN-specific regularization method can effectively strengthen the robustness of BNN (testified on ImageNet-C), achieving state-of-the-art performance on CIFAR and ImageNet.Comment: Paper accepted to ECCV 202

    Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation

    Full text link
    Methods for improving the efficiency of deep network training (i.e. the resources required to achieve a given level of model quality) are of immediate benefit to deep learning practitioners. Distillation is typically used to compress models or improve model quality, but it's unclear if distillation actually improves training efficiency. Can the quality improvements of distillation be converted into training speed-ups, or do they simply increase final model quality with no resource savings? We conducted a series of experiments to investigate whether and how distillation can be used to accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4 with a masked language modeling objective and evaluated on GLUE, using common enterprise hardware (8x NVIDIA A100). We found that distillation can speed up training by up to 1.96x in ResNet-50 trained on ImageNet and up to 1.42x on BERT when evaluated on GLUE. Furthermore, distillation for BERT yields optimal results when it is only performed for the first 20-50% of training. We also observed that training with distillation is almost always more efficient than training without distillation, even when using the poorest-quality model as a teacher, in both ResNet-50 and BERT. Finally, we found that it's possible to gain the benefit of distilling from an ensemble of teacher models, which has O(n) runtime cost, by randomly sampling a single teacher from the pool of teacher models on each step, which only has a O(1) runtime cost. Taken together, these results show that distillation can substantially improve training efficiency in both image classification and language modeling, and that a few simple optimizations to distillation protocols can further enhance these efficiency improvements

    Improving write performance by enhancing internal parallelism of Solid State Drives

    Full text link
    Abstract—Most researches of Solid State Drives (SSDs) archi-tectures rely on Flash Translation Layer (FTL) algorithms and wear-leveling; however, internal parallelism in Solid State Drives has not been well explored. In this research, we proposed a new strategy to improve SSD write performance by enhancing internal parallelism inside SSDs. A SDRAM buffer is added in the design for buffering and scheduling write requests. Because the same logical block numbers may be translated to different physical numbers at different times in FTL, the on-board SDRAM buffer is used to buffer requests at the lower level of FTL. When the buffer is full, same amount of data will be assigned to each storage package in SSDs to enhance internal parallelism. To accurately evaluate performance, we use both synthetic workloads and real-world applications in experiments. We compare the enhanced internal parallelism scheme with the traditional LRU strategy since it is unfair to compare an SSD having buffer with an SSD without a buffer. The simulation results demonstrate that the writing performance of our design is significantly improved compared with the LRU-cache strategy with the same amount of buffer sizes. I

    Real-time Monitoring for the Next Core-Collapse Supernova in JUNO

    Full text link
    Core-collapse supernova (CCSN) is one of the most energetic astrophysical events in the Universe. The early and prompt detection of neutrinos before (pre-SN) and during the SN burst is a unique opportunity to realize the multi-messenger observation of the CCSN events. In this work, we describe the monitoring concept and present the sensitivity of the system to the pre-SN and SN neutrinos at the Jiangmen Underground Neutrino Observatory (JUNO), which is a 20 kton liquid scintillator detector under construction in South China. The real-time monitoring system is designed with both the prompt monitors on the electronic board and online monitors at the data acquisition stage, in order to ensure both the alert speed and alert coverage of progenitor stars. By assuming a false alert rate of 1 per year, this monitoring system can be sensitive to the pre-SN neutrinos up to the distance of about 1.6 (0.9) kpc and SN neutrinos up to about 370 (360) kpc for a progenitor mass of 30M⊙M_{\odot} for the case of normal (inverted) mass ordering. The pointing ability of the CCSN is evaluated by using the accumulated event anisotropy of the inverse beta decay interactions from pre-SN or SN neutrinos, which, along with the early alert, can play important roles for the followup multi-messenger observations of the next Galactic or nearby extragalactic CCSN.Comment: 24 pages, 9 figure
    • …
    corecore